Dating ancient splits in phylogenetic trees, with application to the human-Neanderthal split

Background We tackle the problem of estimating species TMRCAs (Time to Most Recent Common Ancestor), given a genome sequence from each species and a large known phylogenetic tree with a known structure (typically from one of the species). The number of transitions at each site from the first sequence to the other is assumed to be Poisson distributed, and only the parity of the number of transitions is observed. The detailed phylogenetic tree contains information about the transition rates in each site. We use this formulation to develop and analyze multiple estimators of the species’ TMRCA. To test our methods, we use mtDNA substitution statistics from the well-established Phylotree as a baseline for data simulation such that the substitution rate per site mimics the real-world observed rates. Results We evaluate our methods using simulated data and compare them to the Bayesian optimizing software BEAST2, showing that our proposed estimators are accurate for a wide range of TMRCAs and significantly outperform BEAST2. We then apply the proposed estimators on Neanderthal, Denisovan, and Chimpanzee mtDNA genomes to better estimate their TMRCA with modern humans and find that their TMRCA is substantially later, compared to values cited recently in the literature. Conclusions Our methods utilize the transition statistics from the entire known human mtDNA phylogenetic tree (Phylotree), eliminating the requirement to reconstruct a tree encompassing the specific sequences of interest. Moreover, they demonstrate notable improvement in both running speed and accuracy compared to BEAST2, particularly for earlier TMRCAs like the human-Chimpanzee split. Our results date the human – Neanderthal TMRCA to be \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim 408,000$$\end{document}∼408,000 years ago, considerably later than values cited in other recent studies. Supplementary Information The online version contains supplementary material available at 10.1186/s12863-023-01185-8.

1 Theoretical details

Proof of Theorem 1
Denote the Fisher information matrix for the estimation problem above by I ∈ R (n+1,n+1) , where the first n indexes correspond to {λ i } n i=1 and the last index (n + 1) corresponds to p.For clarity denote I p,p .= I n+1,n+1 , I i,p .= I i,n+1 , I p,i .= I n+1,i .Then: Consequently, an unbiased estimator p holds: Proof.We calculate the second derivative of the log-likelihood.Denote: then the first derivatives are given by: and The second derivatives are now given by: The expectation of these are given by: By CRB, for an unbiased estimator:

Proof of Proposition 1
Proof.Following Equations 12, 13, we compare the first order derivatives to 0: Summing the first equation for every i and substituting the second equation results in the last part in Equation 6.

Proof of Proposition 2
, so we can compute the likelihood of p without considering λ i .
3. The maximum likelihood estimate of p given Z i holds: and the maximum likelihood estimate of p given Proof.Denote q ≡ 1 − p.For item 1: Now moving on to item 2: And summing up these two equations leads to: Subsequently, the likelihood of Z i is given by: Taking the derivative to 0: and division by −2 1−2p yields the solution.

34
Now, according to Le Cam's theorem 1 [1], , and the likelihood is therefore: Now we look at the log-likelihood and take the derivative with respect to p to zero: Leading to the solution: More precisely:

Proof of Proposition 3
Let λ i ∼ Γ(α, β), then the maximum a posteriori estimator of p holds: Subsequently, estimated values for α, β can be substituted for a numerical estimator for p.
Proof.We first compute the probability for each observation: Hence, the likelihood is given by: and the log-likelihood: Now comparing the derivative with respect to p to zero: 2 Simulation details and additional experiments

K2P and TN93 simulations
We extracted the parameters of the rate matrix from Phylotree's data and simulated a tree in the same total branch length as Phylotree, with branches short enough to contain an average of less than one base change along the sequence (the branches' length was set as t = 1.5e − 05 when 1 is the total length of Phylotree's branches).The rate matrix at each site was scaled by the observed substitution rate λ i which is the total number of substitutions per site observed along Phylotree.For sites with λ i = 0 we used instead λ i = ϵ with ϵ chosen as explained in the following Supplementary subsection 2.2.Then, using the same rate matrix, we simulated sequences with a predefined distance p from the RSRS and assessed p using our methods.

Phylogenetic tree simulations
The rate parameter for sites with no transitions along the tree is denoted as ϵ, and we estimate it using the following simulation-based method.To generate ⃗ λ, we use the following equation: The value of ϵ is chosen to minimize the Kolmogorov-Smirnov statistic.Figure S1 shows a simulation of D(ϵ), with the mean of 1,000 runs for each ϵ value.The minimum value of D is obtained for ϵ = 0.0913 (marked in red).
To make the simulated data closer to the real data, we also model transversions.We estimate the transversion rate per site in the same manner as the transition rate, using the Kolmogorov-Smirnov statistic to account for sites with no transversions.This results in ϵ transversion = 0.0149.To determine the nucleotide at a given site, we sample whether an odd number of transversions have occurred.If so, a random nucleotide is sampled from the two available transversion options.The resulting sequence is then input into BEAST2, but our methods still use only the sites without observed transversions.Finally, the analysis is limited to the gene regions in the genome (11,341 sites).

BEAST2 run parameters
The sequences used in this work were aligned using mafft [2], and the 11.3 kb of proteincoding genes were extracted and used for the analysis.The analysis followed the approach described in [3], where the best fitting clock and tree model for the tree were identified using path sampling with the model selection package in BEAST2 [4,5,6].Each model test was run with 40 path steps, a chain length of 25 million iterations, an alpha parameter of 0.3, a pre-burn-in of 75,000 iterations, and an 80% burn-in of the entire chain.The mutation rate was set to 1.57 x 10E-8 and a normal distribution (mean: mutation rate, sigma: 1.E-10) was used for a strict clock model [7].The TN93 substitution model [8] was used for all models.The tree was calibrated with carbon dating data from ancient humans and Neanderthals, where available [9,7,10], and modern samples were set to a date of 0. All simulations were run with 4 gamma rate categories, 10,000,000 iterations, and a pre-burn-in of 1,000,000 iterations.simulation with no fixed tree topology.BEAST2 (B): BEAST2 simulation with a fixed tree topology.BEAST2 (C): BEAST2 simulation with a fixed tree topology and 50 additional human sequences.M2: Method 2 which obtained the lowest squared error in our simulations.We performed a one-sided paired Wilcoxon signed rank test on every pair of simulation variations, correcting for multiple comparisons using the Bonferroni correction.Our results show no significant improvement in the squared error between the various BEAST2 simulations.However, when comparing to Method 2 the test shows that Method 2 has the lowest squared error.

Figure S2 :Figure S2 .
Figure S2: Comparison of estimators applied on a simulated long branch without a fixed tree topology for BEAST2.

Figure S4 :Figure S4 .
Figure S4: Comparison of estimators applied on a simulated long branch with two ti/tv ratios